server: fix system_tokens being erased in kv_cache;#6312
server: fix system_tokens being erased in kv_cache;#6312MasterYi1024 wants to merge 2 commits intoggml-org:masterfrom
Conversation
There was a problem hiding this comment.
I do not use the system global prompt, and I think taking into account chat completions route, it is not widely used. But if your theory is true, please show us first a test, ideally using the server test framework.
EDIT: I did read too quickly, running CI workflows
|
And thanks for having looking into this. Contributions are really welcomed. You can have a look at the |
|
Thanks for your reply : ) I haven't read the server test framework codes. I'll work on it, and I will take a look at the |
|
@phymbert Originally, I implemented that function so that the server can function as a chatbot with multiple clients, and the system prompt remains immutable and does not need to be reprocessed for each client. Thus, the client only needs to send the conversation context with its respective user and assistant name provided by |
Hi llama.cpp deveeloppers:)
I read the code these days. And I think there is a chance that system tokens may get erased in kv cache in the server example by this line:
From my little knowledge of the server code(may be wrong cause too many codes to read, sorry), I think in
kv_cache,system_tokensare positioned from 0 to its length, and aftersystem_tokensit'sn_keepprompt_tokens. So If the code before remove the tokens betweenn_keepton_keep + n_discarded, it will remove some of thesystem_tokenswhich makes the generation stop working or generate something meaningless.Below is my test. This problem only can be duplicated with some specific count of tokens. And I just run into it in my daily tests, that's why anything was in Chinese. Sorry again ;p
The system prompt is to make the assistant summarize some text. I wrote this in a translater.json file and use -spf parameter to load it:
{ "prompt": "Assistant's name is John. # CONTEXT # 我需要你总结概括一段文字。 # OBJECTIVE # 阅读用户发给你的文本,总结概括文本内容,为了用户阅读方便,请始终使用与原文本相同的语言进行总结概括。 # STYLE # 不需要有什么风格。 # TONE # 总结概括。 # AUDIENCE # 任何想要了解一段文字大意的人。 # RESPONSE # 回答应该明确易懂,简洁明了。使用与用户输入相同的语言。", "anti_prompt": "User", "assistant_name": "Assistant" }Then I start the server with these parameters, notice that
-cwas commented so its value is default 512. Also you can see I'm using Qwen model with a RTX4090 card:With the server running, I call curl with five questions:
And these are the generations:

As you can see, the first time I ask about its system prompt and name, It answers right. But after I give it a long text to summarize, It forgets its name and system prompt(In this picture, I just ask about the name.).

Now is the generation after applying this pr:



Now you see, It remembered who it is and what it should do!
summary
I made this change just because I find its a little better than before. I don't really know about the logic of the two part of token shift in server example. I tried to read hard about them, but still not really clear. If there is some documentations on kv_cache and these two part of shift code in server example, that would be great. Thanks a lot:)
If this change is wrong, feel free to close the pr.
Bellow is the exacte text I send to server:
I know the new version of server would say "context is too long for kv_cache, ...", so you have to use the exacte text to duplicate this issue.
Thanks in advance:)